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Abstract. We use a graphics processing unit (GPU) for fast calculations of helicity amplitudes of quark 
and gluon scattering processes in massless QCD. New HEGET (HELAS Evaluation with GPU Enhanced 
Technology) codes for gluon self-interactions are introduced, and a C++ program to convert the MadGraph 
generated FORTRAN codes into HEGET codes in CUDA (a C-platform for general purpose computing 
on GPU) is created. Because of the proliferation of the number of Feynman diagrams and the number 
of independent color amplitudes, the maximum number of final state jets we can evaluate on a GPU is 
limited to 4 for pure gluon processes {gg — >4<j), or 5 for processes with one or more quark lines such as 
g<7— >5(? and qq — > qq + 3g. Compared with the usual CPU-based programs, we obtain 60-100 times better 
performance on the GPU, except for 5-jet production processes and the gg — > 4p processes for which the 
GPU gain over the CPU is about 20. 



1 Introduction 

In our previous report [1] we introduced a C-language [2] 
version of the HELAS codes [3], HEGET (HELAS Evalua- 
tion with GPU Enhanced Technology) , which can be used 
to compute helicity amplitudes on a GPU (Graphics Pro- 
cessing Unit). Encouraging results with 40-150 times faster 
computation speed over the CPU performance were ob- 
tained for pure QED processes, qq—m,'-/, for n = 2 to 8 in 
pp collisions. 

In this paper, we extend our study to QCD processes 
with massless quarks and gluons. The HEGET routines 
for massless quarks and gluons are identical to those of 
quarks and photons introduced in p], and the qqg vertex 
function structure is also the same as the qqj functions. 
The only new additional routines are those for ggg and 
9999 vertices. For the QED processes studied in ref. pQ, 
we found that the present CUDA compiler cannot process 
qq— >67 amplitude with 6!~700 Feynman diagrams, and 
we need to subdivide the HEGET codes into small pieces 
for 67 and 77 processes. In the case of 87 production with 
8! ~ 4 x 10 4 Feynman diagrams, we have not been able 
to compile the program even after subdivision into small 
pieces. We also encountered serious slow down when the 
program accesses global memory during the parallel pro- 
cessing period. Therefore, our concern for evaluating the 
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QCD processes on a GPU is the proliferation of the num- 
ber of diagrams, as well as the number of independent 
color amplitudes which come with different color weights. 

The paper is organized as follows. In section 2, we 
present the cross section formula for n-jet production pro- 
cesses in pp collisions in the quark-parton model, or in the 
leading order of perturbative QCD with scale-dependent 
parton distribution functions (PDF's). In section 3, we 
review briefly the structure of GPU computing by using 
HEGET codes, and give basic parameters of the GPU and 
CPU machines used in this analysis. In section 4, we in- 
troduce new HEGET functions for ggg and gggg vertices. 
Section 5 gives our results and section 6 summarizes our 
findings. Appendix lists all the new HEGET codes intro- 
duced in section 4. 



2 Physics Process 

2.1 n-jet production in pp collisions 

The cross section for n-jet production processes can be 
expressed as 

do- = ^ dx a dx b D a / p (x a ,Q) D b/p (x b ,Q)da(s) , (1) 

{a,b} 

where D a / p and D b / p are the scale (Q) dependent parton 
distribution functions (PDF's), x a and x b are the momen- 
tum fractions of the partons a and b, respectively, in the 
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Table 1. The number of Feynman diagrams and the color bases for QCD processes studied in this paper. 



No. of jets 
in the final state 


SS^gluons 


uu — * gluons 


uu —* uu+ gluons 


^diagrams 


^colors 


^diagrams 


^colors 


^diagrams 


#colors 


2 


6 


6 


3 


2 


2 


2 


3 


45 


24 


18 


6 


10 


8 


4 


510 


120 


159 


24 


76 


40 


5 


7245 


720 


1890 


120 


786 


240 



right- and left-moving protons. For the total pp collision 
energy of y^s, 



S X a X b , 



(2) 



gives the invariant mass squared of the hard collision pro- 
cess 

a+b-vl + 2+-+n. (3) 

The subprocess cross section is computed in the leading 
order as 



da(s) 



1 1 



2s 2-2 n a n b 



where 



d$ n = (2^)V ( Pa +p b -j2pi) n 



\ (2tt)3 2uj t ' 



(5) 



is the invariant n-body phase space, Xi are the helicities of 
the initial and hnal partons, n a and n b are the color degree 
of freedom of the initial partons, a and 6, respectively, 
and Ci represents the color indices of the initial and final 
partons. When there are more than one gluons or identical 
quarks in the final states, an appropriate statistical factor 
should be multiplied on the phase space d$ n in eq. (J5]). 
The Helicity amplitudes for the process |TJ) 



a(p Q , A a , Co) + b(p b , X b , c;,) 
-> l(pi,Ai,ci)H hn(p„,A„,c„) 



can be expressed as 



E 

Z£ diagram 



(6) 



(7) 



where the summation is over all the Feynman diagrams. 
The subscripts Xi stand for a given combination of helici- 
ties (±1 for both quarks and gluons in the HELAS conven- 
tion 3|)j and the subscripts c, correspond to a set of color 
indices (1, 2, 3 for flowing-IN quarks, 1,2,3 for flowing- 
OUT quarks, and 1 to 8 for gluons). In MadGraph g] the 
amplitudes are expanded as 



(8) 



in the color bases Tf* which are made from the SU(3) 
generators in the fundamental representation [5] 



The color factors are computed as 
1 



Mat 



n a n b 



(9) 



where n a . b = 3 for q and q, n a , b = 8 for gluons, and the 
summation is over all {qj = {c , c b , ci, . . . , On}. The color 
sum-averaged square amplitudes are computed as 



E l^| 2 = E( J ^)-^( J ^- 



The cross sections are then expressed as 
1 



(10) 



(11) 



where we introduce the helicity sum-average symbol as 

(12) 



O 9 



In this paper the following three types of multi-jet pro- 
duction processes are computed: 



99 99, 999, 9999 

<M 99, 999, 9999, 99999 
uu — * uu, uug, uugg, uuggg 



(13a) 
(13b) 
(13c) 



The number of contributing Feynman diagrams and the 
number of color bases for the above processes are sum- 
marized in Table [U which includes those for the process, 
gg— >5g. We note here that the number of diagrams (7245) 
for gg — > 5g exceeds that of the ml— > 7j process (7! « 5040) , 
for which we could run the converted MadGraph codes on 
a GPU, only after division into small pieces [I!. In fact, 
we have not been able to run the gg — > 5g program on 
GPU even after dividing the program into more than 100 
pieces; as explained in section f5. 41 

Proliferation of the number of independent color basis 
vectors is also a serious concern for GPU computing, since 
the color matrix Af of eq. ([9]) has m(m + l)/2 elements 
when there are m independent basis vectors T£\ For ex- 
ample, the process uu — > uuggg has m = 240 color basis 
vectors from Table[T] and the matrix has 3xl0 4 elements. 
The matrix exceeding 16000 elements cannot be stored in 
the 64kB constant memory, while storing it in the global 
memory will result in serious loss of efficiency in parallel 
computing. Therefore, the method to handle summation 
over color degrees of freedom is a serious concern in GPU 
computing. 
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2.2 Selection criteria for jets 

Total and differential cross sections of the processes (| 1 3 j) 
in pp collisions at i/s = 14TeV are computed in this paper. 
We introduce final state cuts for all the jets as follows: 

\ Vl \ <rf ut =2.h, (14a) 
PTi > Pt* = 20 GeV, (14b) 
PTij >PT Ut = 20GeV, (14c) 

where rji and p^i are the rapidity and the transverse mo- 
mentum of the i-th jet, respectively, in the pp collisions 
rest frame along the right-moving (p z = \p\) proton mo- 
mentum direction, and PTij is the relative transverse mo- 
mentum [6] between the jets i and j defined by 

p Tij = min(pri,PTj) ARij, (15a) 
ARij = ^A v ? 3 +Atf 3 . (15b) 

Here ARij measures the boost-invariant angular separa- 
tion between the jets. 

As for the parton distribution function (PDF), we use 
the set CTEQ6L1 7 and the factorization scale is chosen 
to be the cut-off p T value, Q = p T ut = 20 GeV. The QCD 
coupling constant is also fixed as 

a s = a s {Q = 20GeV) M§ = 0.171, (16) 

which is obtained from the MS coupling at Q — mz, 
"s(™z)ms = 0.118 [5 by using the NLO renormalization 
group equations with 5-flavors. 

3 Computation on the GPU 

3.1 GPU and its host PC 

For the computation of the cross sections of QCD n-jet 
production processes we use the same GPU and host PC as 
in the previous report [Tj . In particular we use a GeForce 
GTX280 by NVIDIA HJ with 240 processors, whose pa- 
rameters are summarized in Table [2j It is controlled by a 
Linux PC with Fedora 8 on a CPU whose properties are 
summarized in Table [31 

Programs which are used for the computation of the 
cross sections are developed with the CUDA [2] environ- 
ment introduced by NVIDIA [TT] for general purpose GPU 
computing. 

3.2 Program structure 

Our program computes the total cross sections and dis- 
tributions of the QCD n-jet production processes via the 
following procedure: 

1. initialization of the program, 

2. random number generation for multiple phase-space 
points {p a ,Pb,Pi, ■ ■ ■ iPn} and helicities {A,} on the 
CPU, 



Table 2. Parameters of GeForce GTX280. 



Number of 


30 


multiprocessor 




Number of core 


240 


Total amount of 


1000 


global memory [MB] 




Total amount of 


64 


constant memory [kB] 




Total amount of shared 


16 


memory per block [kB] 




Total number of registers 


16 


available per block [kB] 




Clock rate [GHz] 


1.30 



Table 3. Host PC environment. 



CPU 


Core2Duo 3GHz 


L2 Cache 


6MB 


Memory 


4GB 


Bus Speed 


1.333GHz 


OS 


Fedora 8 (64 bit) 



3. transfer of random numbers to the GPU, 

4. generation of helicities and momenta of initial and fi- 
nal partons using random numbers, and compute am- 
plitudes (J\i) a of eq. ([5]) for all the color bases on the 
GPU, 

5. multiplying the amplitudes and their complex conju- 
gate with the color matrix Af a b of eq. © and summing 
them up as in eq. (|10p . and multiply the PDF's of the 
incoming partons on the GPU, 

6. transferring momenta and helicities for external parti- 
cles, computed weights and the color summed squared 
amplitudes to the CPU, and 

7. summing up all values to obtain the total cross section 
and distributions on the CPU. 

Program steps between the generation of random num- 
bers (2) and the summation of computed cross sections (7) 
are repeated until we obtain sufficient statistics for the 
cross section and all distributions. 



3.3 Color matrix calculation 

In order to compute the cross sections of the QCD multi- 
jet production processes, multiplications of the large color 
matrix M a b of eq. ©, the vector of color-bases amplitudes 
{J\i) a of eq. and its complex conjugate have to be 
performed, as in eq. (fTU|) . For large n-jet processes, like 
gg — > 4 gluons, uu~ > 5 gluons and uu — ► uu + 3 gluons, the 
dimensions of color matrices exceed 100, and the number 
of multiplication becomes larger than 10 4 . These matrices 
cannot be stored in the constant memory (64kB for the 
GTX280; see Tabled]) which is accessed in parallel, while 
storing them in the global memory (1GB for GTX280) 
results in serious slow-down of the GPU. We find that 
multiplications for the color-summation in eq. (|10p can be 
reduced significantly as follows. 



4 
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Table 4. Number of different non-zero elements in the color 
matrix of eq. ©. 



No. of jets 


#3 ^gluons 


uu — > gluons 


uu — > uu+ gluons 


2 


3 


2 


2 


3 


7 


4 


7 


4 


15 


9 


19 


5 


45 


24 


60 



The color matrix of cq. ([9]) contains many elements 
with the same value. We count the number of different 
non-zero elements in the color matrix and find the results 
shown in Table 2] We find for instance that among the 
240x (240+l)/2 = 28, 720 elements of the color matrix for 
the uu^uu + 3g process, there are only 60 unique ones. 

In general, the number of different elements in the 
color matrix grows linearly rather than quadratically as 
the number of color basis vectors grows. Since the num- 
bers in Table [4] are small enough, we can store them in 
the constant memory which is accessed quickly by each 
parallel processor. 

Before we arrive at the above solution adopted in this 
study, we examined the possibility of summing over colors 
via Monte Carlo. Let us briefly report, in passing, on this 
exercise. 

In the Monte Carlo color summation approach, we 
evaluate the matrix element A4 C ^. (O for a given set of 
momenta {pi}, helicities {A^} and colors {ci}, and sum 
the squared amplitudes over randomly generated sets of 
{pi, Xi, Ci}. This method turns out not to be efficient be- 
cause in the color basis using the fundamental representa- 
tion of the SU(3) generators adopted by MadGraph, most 
of the basis vectors T£' vanish for a given color configura- 
tion {ci}. As an example, gg — ► Ag has 5! = 120 color basis 
vectors (see Table [lj , which take the form 



T Ci = Tr ^p a ^ r j ia ^ . . . r p a &^ 



(17) 



for the configuration {c{\ = (ai, a^, ■ ■ ■ , tie) where Oj de- 
notes the color index of the gluon i taking an integer value 
between 1 and 8. Among the 8 6 «260,000 configurations, 
only 12% give non-zero values. Moreover, as many as 75% 
of the color configurations give vanishing results for all the 
120 basis vectors. Although the efficiency can be improved 
by changing the color basis, we find that our solution of 
evaluating the exact summation over colors is superior to 
the Monte Carlo summation method for all the processes 
which we report in this paper. 



4 New HEGET functions 

The HEGET functions for massless quarks and gluons are 
the same as those introduced in the previous report pQ. 
The qqg vertex functions are identical to the qq^ functions 
of ref. [1] except for the coupling constant; 



for the vertex 

C qqg = -gJ%Al(x)Tn(x)vM(x) 



(19) 



where g s — \J Akol 8 is the strong coupling constant and T a -. 
is an SU(3)) generator in the fundamental representation. 
For example, the qqg vertex function is computed by the 
HEGET function iovxxO as 

iovxxO (cmplx* fi, cmplx* fo, cmplx* vc , float g, 

cmplx vertex) 

(20) 

where the coupling constants are 

g[0] = g[l] = g s (21) 

following the convention of MadGraph [J] and the color 
amplitude is 1 

- T% (vertex) . (22) 

In the rest of this section, we introduce new HEGET 
functions for three- vector boson (VVV) and four- vector bo- 
son (WW) vertices. All the new HEGET functions are 
listed in Tabled and their contents are given in Appendix. 
Also shown in Table [5] is the correspondence between the 
HEGET functions and the HELAS subroutines [3]. 

4.1 VVV: three vector boson vertex 

For the ggg vertex 

£ gag = gsf abc (d^(x))Al(x)At(x) (23) 

we introduce two HEGET functions, vvvxxx and jvvxxO. 
They correspond to HELAS subroutines, VVVXXX, and 
JVVXXX, respectively, for massless particles; see Table O 

4.1.1 vvvxxx 

The HEGET function vvvxxx (List Q] in Appendix) com- 
putes the amplitude of the VVV vertex from vector boson 
wave functions, whether they arc on-shcll or off-shell. The 
function has the arguments: 

vvvxxx (cmplx* ga, cmplx* gb, cmplx* gc, 
float g, cmplx vertex) 

where the inputs and the outputs are: 

Inputs: 

cmplx ga [6] wavefunction of gluon with color 
index, a 

cmplx gb [6] wavefunction of gluon with color 
index, b 

cmplx gc [6] wavefunction of gluon with color (25) 
index, c 

float g coupling constant of VVV vertex 

Outputs: 

cmplx vertex amplitude of the VVV vertex 



(24) 



eQ q 



r P a L 



(18) 



1 The sign of the color amplitudes (|22[) and (|27[) follows the 
sign of the Lagrangian terms (|19[) and (|23l) . respectively. Mad- 
Graph [3] adopts the Lagrangian with the opposite sign, that 
is, (5s)MadGraph = ~9s - This sign difference is absorbed by the 
conventions (l2"2l) and (|27"1). 
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Table 5. List of the new vertex functions in HEGET. 



Vertex 


Inputs 


Output 


HEGET Function 


HELAS Subroutine 


VVV 


VVV 


Amplitude 


vvvxxx 


VVVXXX 




vv 


V 


j vvxxO 


JVVXXX 


WW 


GGGG 


Amplitude 


ggggxx 


GGGGXX 




GGG 


G 


JgggxO 


JGGGXX 



The coupling constant is 

g - 9s (26) 

in the HEGET function (f2"4")l . following the convention of 
MadGraph [4]. In order to reproduce the amplitudes as- 
sociated with the ggg vertex Lagrangian of eq. l|23p . the 
color factor associated with the ggg vertex is if abc . More 
explicitly, the vertex amplitude for eq. ([23]) is 1 



if abc (vertex) 



(27) 



by using the output (vertex) in eq. (j24|) . Also note the 
HELAS convention [5] of using the flowing-OUT momenta 
and quantum numbers for all bosons. 

4.1.2 jvvxxO 

This HEGET function jvvxxO (List [2] in Appendix) com- 
putes the off-shell vector wavefunction from the three- 
point gauge boson coupling in eq. (|23|) . The vector propa- 
gator is given in the Feynman gauge for a massless vector 
bosons like gluons. It has the arguments: 



(28) 



jvvxxO (cmplx* ga, cmplx* gb, float g, 

cmplx* jw) 

where the inputs and the outputs are: 

Inputs: 

cmplx ga [6] wavefunction of gluon with color 
index, a 

cmplx gb [6] wavefunction of gluon with color 
index, b 

float g coupling constant of the VVV vertex 



Outputs: 

cmplx jvv[6] vector current j^(gc:ga, gb) which has 
a color index, c 

(29) 

As in eq. (|27|) the color amplitude for the off-shell current 
is 

*f abc (jw) . (30) 



4 -2.1 ggggxx 

The HEGET function ggggxx (List [3] in Appendix) com- 
putes the portion of the amplitude of the gggg amplitude 
where the first and the third, and hence also, the sec- 
ond and the fourth gluon wave functions are contracted, 
whether the gluons are on-shell or off-shell. The function 
has the arguments: 



ggggxx (cmplx* ga, cmplx* gb, cmplx* gc, 
cmplx* gd, float g, cmplx vertex) 

where the inputs and the outputs are: 



(32) 



Inputs: 
cmplx ga [6] 

cmplx gb [6] 

cmplx gc [6] 

cmplx gd [6] 

float gg 



wavefunction of gluon with color 
index, a 

wavefunction of gluon with color 
index, b 

wavefunction of gluon with color 
index, c 

wavefunction of gluon with color 
index, d 

coupling constant of VVV vertex 



(33) 



Outputs: 

cmplx vertex amplitude of the WW vertex 
The coupling constant gg for the gggg vertex is 
gg = 9s ■ 



(34) 



In order to obtain the complete amplitude, the function 
must be called three times (once for each color structure) 
with the following permutations: 

gggg xx (ga,gb,gc,gd,gg,vl) (35a) 

ggggxx(ga,gc,gd,gb, gg,v2) (35b) 

gggg xx (g a , g d , gb, g c , gg, v3) (35c) 

The color amplitudes are then expressed as 

jabejede ^ + jacejdbe ^ + jade jbee ^3) _ ( 36 ) 



4.2 WW: four vector boson vertex 

For the ggggg vertex 

£«™ = -£f abe f cde A a »(x)A»»(x)Al(x)At(x) (31) 

we introduce two HEGET functions, ggggxx and jgggxO, 
listed in Tabled They correspond to HELAS subroutines, 
GGGGXX and JGGGXX, respectively, for massless particles. 



4 -2.2 jgggxO 

The HEGET function j gggxO (List |4] in Appendix) com- 
putes an off-shell gluon current from the four-point gluon 
coupling, including the gluon propagator in the Feynman 
gauge. It has the arguments: 

jgggxO (cmplx* ga, cmplx* gb, cmplx* gc , float gg, 

cmplx* jggg) 

(37) 



6 
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where the inputs and the outputs are: 
Inputs: 

cmplx ga [6] wavefunction of gluon with color 
index, a 

cmplx gb [6] wavefunction of gluon with color 
index, b 

cmplx gc [6] wavefunction of gluon with color 
index, c 

float gg coupling constants of the WW vertex 

Outputs: 

cmplx jggg[6] vector current j M (gd:ga, gb, gc) which 
has a color index, d 

(38) 

The function (f3"T)) computes off-shell gluon wave function 
with three specific color index d which comes along with 
a specific color factor. As in eq. ([33)1 it should be called 
three times 

jgggxO(ga,gb,gc,gg, jl) (39a) 
jgggxO(gc,ga,gb,gg, j2) (39b) 
jgggxO(gb,gc,ga,gg, j3) (39c) 

to give the off-shell gluon with the color factor 

f abe f cde + jace f dbe (j2) + f ade f bce (j3) _ (4Q) 



5 Results 

5.1 Comparison of total cross sections 

In order to validate the new HEGET functions which are 
introduced in this report, we compare the total cross sec- 
tions of n-jet production processes computed on the GPU 
with those calculated by other programs which are based 
on the FORTRAN version of the HELAS library. We use 
MadGraph/MadEvent J] and another independent FOR- 
TRAN program which uses the Monte Carlo integration 
program, BASES [12] . as references. Due to the limited 
support for the double precision computation capabilities 
on the GPU, the whole computations with HEGET on a 
GTX280 are done with single precision, while the other 
programs with HELAS in FORTRAN compute cross sec- 
tions with double precision. 

For the calculation of the n-jet production cross sec- 
tions we use the same physics parameters as the Mad- 
Graph/MadEvent for all programs, and the same final 
state cuts of eq. dl4l) for all processes and all programs. 
The parton distribution functions of CTEQ6L1 [7J and the 
same factorization and renormalization scales, Q = p™* = 
20GeV, are also used. 

Results for the computation of the total cross sections 
are summarized in Tables [51 and [5] for gg — > gluons, 
mm— > gluons and uu — > uu-fgluons, respectively. We find the 
results obtained by the HEGET functions agree with those 
from the other programs within the statistics of generated 
number of events. 



We note that multi-jet events that satisfy the final 
state cuts of eq. (fbl)) . where all jets are in the central 
region in |n| < 2.5 (|14a[) and their transverse momentum 
about the beam direction (|14bj) and among each other 
(|14c[) greater than 20 GeV, are dominated by pure glu- 
onic processes in Table [6J The cross sections for uu ~ * ng 
process in Table UJ are small because of uu annihilation. 
We note that the crossing-related non-annihilation pro- 
cesses, ug — ► u+(n—l)g, have exactly the same number 
of diagrams and color bases, hence can be evaluated with 
essentially the same amount of computation time. 

5.2 Comparison of the processing time 

As already described in our previous report [1] , we prepare 
two versions of the programs in the same structure for the 
computation of the total cross sections. One is written in 
CUDA, a C-based language, and can be executed on the 
GPU. The other is written in C and can be executed on the 
CPU. Using a standard C library function we measure the 
time between the start of the transfer of random numbers 
to the GPU and the end of the transfer of computed results 
back to the CPU. 

In Fig. [TJ the measured process time in usee for one 
event of n-jet production processes is shown for the GPU 
(GTX280) and the CPU (Linux PC with Fedora 8). They 
are plotted against the number of jets in the final state. 
Because the process time per event on the GPU depends [I] 
strongly on the number of allocated registers at the com- 
pilation by the CUDA and the size of thread blocks at the 
execution time, we scan combination of these parameters 
for the fastest event process time on the GPU. 

The upper three lines in Fig. [T] show the event process 
times on the CPU. They correspond to gg — > n-jets de- 
noted as gg, uu —* n-jets as uu and uu — > uu + n-jets as 
uu, respectively. For processes with small numbers of jets, 
e.g. nj ct = 2, the event process times for different processes 
are all around 4.5 usee. This is probably because they are 
dominated by computation steps other than the amplitude 
calculations, such as computations of the PDF factors and 
the data transfer between GPU and CPU, which are com- 
mon to all physics processes. When the number of jets 
becomes larger, the event process time for the same num- 
ber jets in the final states is roughly proportional to the 
number of diagrams of each process listed in Table [TJ 

The lower three lines in Fig. [1] show the event process 
times on a GTX280. They also correspond to gg— > n-jets 
denoted as gg, uu — > n-jets as uu and uu — > uu + n-jets 
as uu, respectively. As the number of jets becomes larger, 
the process time on the GPU grows more rapidly than 
that on the CPU. For the nj ct = 4 case, the event process 
time of gg — > 4 gluons is larger than the expected time 
from the proportionality to the number of diagrams of the 
other processes, uu— > 4 gluons and uu^uu + 2 gluons. In 
other words, the event process time on GPU grows faster 
than what we expect from the growth of the number of 
Feynman diagrams. 

For instance, the event process times ratio for <?<?— >4<? 
and gg — > 3g on the CPU are roughly 120 /xsec/14 /xsec 



K. Hagiwara et al.: Calculation of HELAS amplitudes for QCD 7 





Table 6. Total 


cross sections for 


ffff^graons [fb]. 




No. of jets 


HEGET 


Bases 


MadGraph/MadEvent 




2 


q i Q9Q 4- n nm n 


q i Q9s 4- n nm n 


°\ 1 QD9 4 fl 0fl7fi 


x 10 11 


Q 
O 


9 (S9m 4- n nn9^ 

z>.\j£j\jL zn u.uuio 


9 ri 4 n nn^fi 

i.uiou zn u.uuou 


9 fi99i 4 n nnfii 

^..U__'.L Z1Z U.UUU1 


xlO 10 


4 


5.813 ±0.020 


5.8140 ± 0.0095 


5.776 ± 0.034 


xlO 9 




Table 7. Total 


cross sections for 


uu -^gluons [fb]. 




No. of jets 


HEGET 


Bases 


MadGraph/MadEvent 




2 


2.8981 ± 0.0007 


2.8969 ± 0.0006 


2.8991 ± 0.0073 


xlO 7 


3 


1.8420 ± 0.0012 


1.8388 ± 0.0018 


1.8421 ± 0.0077 


xlO 6 


4 


4.465 ± 0.022 


4.496 ±0.017 


4.399 ± 0.038 


xlO 5 


5 


1.566 ±0.057 


1.589 ±0.018 


1.542 ± 0.039 


xlO 5 


Table 8. Total cross sections for uu — >uu+gluons [fb]. 


No. of jets 


HEGET 


Bases 


MadGraph/MadEvent 




2 


2.6715 ± 0.0014 


2.6743 ± 0.0011 


2.6689 ± 0.0047 


xlO 8 


3 


5.897 ± 0.004 


5.889 ±0.010 


5.871 ±0.015 


xlO 7 


4 


2.7754 ± 0.0130 


2.7500 ± 0.0083 


2.748 ± 0.042 


xlO 7 


5 


1.513 ±0.024 


1.560 ±0.013 


1.513 ±0.024 


xlO 6 



~8.6, which roughly agrees with the ratio of the numbers 
of Feynman diagrams (Table [1]), 510/45^ 11. The corre- 
sponding ratio on GPU is 3.8 /isec/0.1 //sec~38, which is 
significantly larger. 

For the same number of jets, we also observe that 
the event process times on the CPU are roughly pro- 
portional to the number of diagrams. Fir nj Ct = 4, the 
ratio of the process times for gg — > Ag to uu — > 4g are 
about 120 /Ltsec/29 /isec ~ 4.1 on CPU, as compared to 
the ratio of the number of Feynman diagrams in Table [TJ 
510/159 ~ 3.2. The same applies to 7ij Ct = 5 between 
im— >5<? and uu—>uuggg, where Feynman diagrams have 
the ratio 1890/786~2.4 from Table[Q and the event pro- 
cess time on the CPU gives 300 /xsec/180 /isec~1.7, also 
in rough agreement . 

On the other hand, the event process times on the GPU 
for gg—>4g and uu—tAg have a ratio 3.8 /zsec/0.45 fisec 
~8.4 which is much larger than the ratio of the diagram 
numbers; while that for ml— >5g and uu^uuggg has the 
ratio of ll/isec/9.5/xsec ~ 1.15. Although we do not fully 
understand the above behavior of the event process time 
on the GPU, we find that they tends to scale as the prod- 
uct of the number of Feynman diagrams and the number 
of color bases, while the event process times on the CPU 
are not sensitive to the latter. This is probably because as 
the number of color bases grows, more amplitudes, (J\i) a 
in eq. ([8|) , should be stored and then called to compute the 
color sum, eq. (jTUJ) ■ These observations tell us that the rel- 
ative weight of the color matrix computation in the GPU 
computing is very significant even after identifying the in- 
dependent elements of the color matrix Af a p in eq. © as 
listed in Table H 



3 : Event Process Time on GTX280 
1 f QCD Processes 




Number of Jets in Final State 

Fig. 1. Processing time for GPU and CPU. 



5.3 Comparison of performance of GPU and CPU 

The ratios of event process times between CPU and GPU 
are shown in Fig. [2] Three lines correspond to gg— >n-jets 
denoted as gg, uu— ► n-jets as uu and uu^ uu+(n — 2)- 
jets as uu, respectively. The performance ratios exceed 
100 for the processes with small numbers of jets (rijet < 3) 
in the final state. For nj e t = 4 and 5, the performance 
ratios gradually drop to less than 40. For processes with 
large numbers of color bases, the ratios are smaller. For 
gg— +4gluons, which has 120 color bases, the ratio is about 
30, and for itw ^uu± 3 gluons, which has 240 color bases, 
the ratio becomes about 20. 

5.4 Note on gg — > 5g study 

Among five-jet production processes we have not been able 
to run the program for gg—f5g. This process has 7245 di- 
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CPU / GPU(GTX280) 




gg 








V\ uu 




UD 











Number of Jets in Final State 



Fig. 2. Ratio of processing time. Time on CPU divided by 
time on GPU. 



agrams and 720 color basis vectors. In order to compile 
the program for the computation of this process, we use 
the technique developed in the previous study PQ. By di- 
viding the program into about 140 pieces we were able 
to compile the gg—>5g program. Compilation takes about 
90 min. on a Linux PC. The total size of the compiled pro- 
gram exceeds 200 MB, and we were not able to execute 
this compiled program on a GTX280. 



tween GPU and CPU are done for the multi-jet pro- 
duction processes up to 5 jets, excluding the purely 
gluonic subprocess. 

— Event process times of the GPU program on GTX280 
are more than 100 times faster than the CPU program 
for all the processes up to 3-jets, while the gain is re- 
duced to 60 for 4-jets with one or two quark lines, and 
to 30 for the purely gluonic process. It further goes 
down to 30 and 20 for 5-jet production processes with 
one and two quark lines, respectively. 

— We find that one cause of the rapid loss of GPU gain 
over CPU as the number of jets increases is the growth 
in the number of color bases. GPU programs slow down 
for processes with larger numbers of color basis vectors, 
while the performance of the CPU programs is not 
affected much. 

— All computations on the GPU were performed with 
single precision accuracy. A factor of 2.5 to 4 slower 
performance is found for double precision computation 
on the GPU. 
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6 Summary 

We have shown the results of our attempt to evaluate QCD 
multi-jet production processes at hadron colliders on a 
GPU [H] , Graphic Processing Unit, following the encour- 
aging results obtained for QED multi-photon production 
processes in ref. [TJ. 

Our achievements and findings may be summarized as 
follows. 

— A new set of HEGET functions written in CUDA [SJ, a 
C-language platform developed by NVIDIA for general 
purpose GPU computing, are introduced to compute 
triple and quartic gluon vertices. The HEGET routines 
for massless quarks were introduced in ref. [I], and 
the routine for photons [T] can be used for gluons. In 
addition, the HEGET functions for the qqg vertex are 
the same as those for the qqj vertex introduced in 
ref. [J. 

— The HELAS amplitude code generated by MadGraph [J] 
is converted to a CUDA program which calls HEGET 
functions for the following three type of subprocesses: 
gg — > ng (n < 5), uu — > ng (n < 5), and uu — > uu + ng 
(n<3). 

— Summation over color degrees of freedom was performed 
on a GPU by identifying the same valued elements of 
the color matrix of eq. ^ , in order to reduce the mem- 
ory size. 

— All the HEGET programs for up to 5 jets passed the 
CUDA compiler after division into small pieces. How- 
ever, we could not execute the program for the process 
gg—fbg. Accordingly, comparisons of performance be- 



Appendix A Additional HEGET functions 



In the appendix, we list the HEGET functions introduced 
in this report. They are for the ggg and gggg vertices 
which do not have counterparts in QED. Together with the 
HEGET functions listed in ref. pQ, the quark and gluon 
(photon) wave functions and the qqg{qq"f) vertices, all the 
QCD amplitudes can be computed on GPU. 



Appendix A.l Functions for the VVV vertex 



List 1. vvvxxx.cu 



#include "cmplx.h" 
device 

void vvvxxxCcmplx* ga, cmplx* gb, cmplx* gc, 
float g, cmplxfc vertex) 

cmplx vl2 = ga[o]*gb[o] 

- ga[i]*gb[i] - ga[2]*gb[2] - ga [3] *gb [3] ; 
cmplx v23 = gb [o] *gc [o] 

- gb[l]*gC[l] - gb[2]*gC[2] - gb[3]*gC[3]; 

cmplx v31 = gc[o]*ga[o] 

- gc[l]*ga[l] - gc[2]*ga[2] - gc[3]*ga[3] ; 

float pga[4] ; 
float pgb[4] ; 

float pgC [4] ; 

pga[o] = ga[4] .re; 
pga[l] = ga[5] .re; 
pga[2] = ga[5] .im; 
pga[3] = ga[4] .im; 

pgb [0] = gb [4] . re ; 

pgb[l] = gb[5] .re; 

pgb [2] = gb[5] .im; 

pgb [3] = gb[4] .im; 
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pgc[o] = gc[4] .re ; 

pgc[l] = gc[5] .re; 

pgc[2] = gc[5] .im; 

pgc[3] = gc[4] .im; 

cmplx pl2 = pga[o]*gb[o] 

- pga[l]*gb[l] - pga[2]*gb[2] - pga[3]*gb 

[3]; 

cmplx pl3 = pga[o]*gc[o] 

- pga[l]*gc[l] - pga[2]*gc[2] - pga[3]*gc 

[3] ; 

cmplx p21 = pgb[o]*ga[o] 

- pgb[l]*ga[l] - pgb[2]*ga[2] - pgb[3]*ga 

[3]; 

cmplx p23 = pgb[o]*gc[o] 

- pgb [1] *gc [l] - pgb [2] *gc [2] - pgb[3]*gc 

[3]; 

cmplx p31 = pgc[o]*ga[o] 

- pgc[l]*ga[l] - pgc[2]*ga[2] - pgc[3]*ga 

[3]; 

cmplx p32 = pgc[o]*gb[o] 

- pgc [l] *gb [l] - pgc[2]*gb[2] - pgc[3]*gb 

[3] ; 



vertex 



return; 



-Cvl2*(pl3-p23) 
v23*(p21-p31) 
v31*(p32-pl2))*g; 



Appendix A. 2 Functions for the WW vertex 



List 3. ggggxx.cu 



#include " cmplx. h" 
device 

void ggggxxCcmplx* ga, cmplx* gb, cmplx* gc, 
cmplx* gd, float gg, cmplx& vertex) 

cmplx gad = ga[o]*gd[o] 

-ga [l] *gd [l] -ga [2] *gd [2] -ga [3] *gd [3] ; 

cmplx gbc = gb [o] *gc [o] 

-gb [1] *gC [1] -gb [2] *gC [2] -gb [3] *gC [3] ; 

cmplx gac = ga[o]*gc[o] 

-ga [I] *gc [l] -ga [2] *gc [2] -ga [3] *gc [3] ; 

cmplx gbd = gb[o]*gd[o] 

-gb [l] *gd [l] -gb [2] *gd [2] -gb [3] *gd [3] ; 

vertex = gg*(gad*gbc-gac*gbd) ; 

return; 



List 4. jgggxO.cu 



List 2. jvvxxO.cu 



#include "cmplx. h" 
device 

void jggxxxCcmplx* ga, cmplx* gb, float g, 
cmplx* jvv) 

{ 

jvv [4] = ga[4] + gb[4] ; 
JVV [5] = ga[5] + gb[5] ; 

float pi [4] ; 
float p2[4] ; 
float q[4] ; 

pl[0] = (ga[4] .re); 

pl[l] = (ga[5] .re); 

pi [2] = (ga[5] .im); 

pi [3] = (ga[4] .im); 

p2[0] = (gb[4].re); 

p2[l] = (gb[5].re); 

p2[2] = (gb[5].im); 

p2[3] = (gb[4].im); 

q[o] = -(jvv [4] .re) 
q[l] = -Cjvv[5] .re) 
q[2] = -(jvv [5] .im) 
q[3] = -(jvv [4] .im) 

float s = q[0]*q[0]-q[l]*q[l]-q[2]*q[2]-q[3]*q 
[3] ; 

cmplx gab = ga[o]*gb[o] 

- ga[l]*gb[l] - ga[2]*gb[2] - ga[3]*gb[3]; 

cmplx sga = (p2[0]-q[0])*ga[0] - (p2[l]-q[l])* 
ga[l] 

- (p2[2]-q[2])*ga[2] - (p2[3]-q[3])*ga[3] ; 

cmplx sgb = -(pl[0]-q[0])*gb[0] + (pl[l]-q[l])* 
gb[l] 

+ (pl[2]-q[2])*gb[2] + (pl[3]-q[3])*gb[3] ; 
float gs = -g*(l.of/s); 

jvv[o] = gs*((pl[o]-p2[o])*gab 

+ sga*gb[o] + sgb*ga[o]) 
jvv[l] = gs*((pl[l]-p2[l])*gab 

+ sga*gb[l] + sgb*ga[l]) 
jvv[2] = gs*((pl[2]-p2[2])*gab 

+ sga*gb[2] + sgb*ga[2]) 
jvv[3] = gs*((pl[3]-p2[3])*gab 

+ sga*gb[3] + sgb*ga[3]) 

return; 



#include " cmplx. h" 
device 

void jgggxOCcmplx* ga, cmplx* gb, cmplx* gc, 
flc ' 

{ 



Loat gg, cmplx* jggg) 



jggg[4] = ga[4]+gb[4]+gC[4] ; 
Jggg [51 = ga[5]+gb[5]+gC[5] ; 

float q[4] ; 
q[0] = -jggg [4] .re 
qM = "Jggg [5] re 
q[2] = -Jggg [5] .im 
q[3] = "JgggM.im 

float fact = gg*(l.of/(q[o]*q[o] 

-q[l] *q [1] -q[2] *q[2] -q[3] *q 
[3])) ; 

cmplx gcb = gc [o] *gb [o] 

-gC [1] *gb [1] -gC [2] *gb [2] -gC [3] *gb [3] ; 

cmplx gac = ga[o]*gc[o] 

-ga ri] *gc [1] -ga [2] *gc [2] -ga [3] *gc [3] ; 

jggg[o] = fact*( ga[o]*gcb - gb[o]*gac ) 
jggg[l] = fact*( ga[l]*gcb - gb[l]*gac ) 
Jggg[2] = fact*( ga[2]*gcb - gb[2]*gac ) 
Jggg[3] = fact*( ga[3]*gcb - gb[3]*gac ) 



return; 
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